Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

120

Applications in Natural Language Processing

derivative in the backward propagation. In detail, for the weight of binarized linear layers,

the common practice is to redistribute the weight to zero-mean for retaining representation

information [199] and applies scaling factors to minimize quantization errors [199]. The

activation is binarized by the sign without re-scaling for computational eﬃciency. Thus, the

computation can be expressed as

bi-linear(X) =αw(sign(X) ⊗sign(W −μ(W))),

αw = ¹

n^∥^W^∥^ℓ¹^,

(5.3)

where W and X denote full-precision weight and activation, μ(·) denotes the mean value,

αw is the scaling factors for weight, and ⊗denotes the matrix multiplication with bitwise

xnor and bitcount. Besides, the quantization of activation X in Eq. (5.3) is set to higher

bit-widths in some works to boost the performance of binarized BERT [6, 222].

The input data ﬁrst passes through a quantized embedding layer before being fed into

the transformer blocks [285, 6]. And each transformer block consists of two main components

are the Multi-Head Attention (MHA) module and the Feed-Forward Network (FFN). The

computation of MHA depends on queries Q, keys K, and values V, which are derived from

hidden states H ∈R^N^×^D. N represents the length of the sequence, and D represents the

dimension of features. For a speciﬁc transformer layer, the computation in an attention

head can be expressed as

Q = bi-linearQ(H),

K = bi-linearK(H),

V = bi-linearV (H),

(5.4)

where bi-linearQ, bi-linearK, and bi-linearV represent three diﬀerent binarized linear layers

for Q, K, and V, respectively. Then the attention score A is computed as follows:

A =

√

BQ ⊗BK

⊤

BQ = sign(Q),

BK = sign(K),

(5.5)

where BQ and BK are the binarized query and key, respectively. Note that the obtained

attention weight is then truncated by attention mask, and each row in A can be regarded

as a k-dim vector, where k is the number of unmasked elements. Then attention weights

B^s

A ^{are binarized as}

B^s

A ^{= sign(softmax(}^A⁾⁾^.

(5.6)

Despite the appealing properties of network binarization for relieving the massive pa-

rameters and FLOPs, it is technically hard from an optimization perspective for BERT

binarization. As illustrated in Fig. 5.1, the performance for quantized BERT drops mildly

from 32-bit to as low as 2-bit, i.e., around 0.6% ↓on MRPC and 0.2% ↓on MNLI-m of

the GLUE benchmark [230]. However, when reducing the bit-width to one, the performance

drops sharply, i.e., ∼3.8% ↓and ∼0.9% ↓on the two tasks. In summary, binarization

of BERT brings severe performance degradation compared with other weight bit-widths.

Therefore, BERT binarization remains a challenging yet valuable task for academia and in-

dustries. This section surveys existing works and advances for binarizing BERT pre-trained

models.